To predict which customer is more likely to purchase the newly introduced travel package.
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pandas_profiling
sns.set(color_codes=True)
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import f1_score,accuracy_score, recall_score, precision_score, roc_auc_score, roc_curve, confusion_matrix, precision_recall_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier, RandomForestClassifier, StackingClassifier
from xgboost import XGBClassifier
# read data from excel file
data = pd.read_excel('Tourism.xlsx','Tourism')
# get columns
data.columns
Index(['CustomerID', 'ProdTaken', 'Age', 'TypeofContact', 'CityTier',
'DurationOfPitch', 'Occupation', 'Gender', 'NumberOfPersonVisiting',
'NumberOfFollowups', 'ProductPitched', 'PreferredPropertyStar',
'MaritalStatus', 'NumberOfTrips', 'Passport', 'PitchSatisfactionScore',
'OwnCar', 'NumberOfChildrenVisiting', 'Designation', 'MonthlyIncome'],
dtype='object')
# get size of dataset
data.shape
(4888, 20)
# check dataset information
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 4888 non-null int64 1 ProdTaken 4888 non-null int64 2 Age 4662 non-null float64 3 TypeofContact 4863 non-null object 4 CityTier 4888 non-null int64 5 DurationOfPitch 4637 non-null float64 6 Occupation 4888 non-null object 7 Gender 4888 non-null object 8 NumberOfPersonVisiting 4888 non-null int64 9 NumberOfFollowups 4843 non-null float64 10 ProductPitched 4888 non-null object 11 PreferredPropertyStar 4862 non-null float64 12 MaritalStatus 4888 non-null object 13 NumberOfTrips 4748 non-null float64 14 Passport 4888 non-null int64 15 PitchSatisfactionScore 4888 non-null int64 16 OwnCar 4888 non-null int64 17 NumberOfChildrenVisiting 4822 non-null float64 18 Designation 4888 non-null object 19 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(7), object(6) memory usage: 763.9+ KB
# check dataset missing values
total = data.isnull().sum().sort_values(ascending=False) # total number of null values
print(total)
DurationOfPitch 251 MonthlyIncome 233 Age 226 NumberOfTrips 140 NumberOfChildrenVisiting 66 NumberOfFollowups 45 PreferredPropertyStar 26 TypeofContact 25 Designation 0 OwnCar 0 PitchSatisfactionScore 0 Passport 0 CustomerID 0 MaritalStatus 0 ProdTaken 0 NumberOfPersonVisiting 0 Gender 0 Occupation 0 CityTier 0 ProductPitched 0 dtype: int64
# check for duplicates
data.duplicated().sum()
0
# check first rows of data
data.head()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 200000 | 1 | 41.0 | Self Enquiry | 3 | 6.0 | Salaried | Female | 3 | 3.0 | Deluxe | 3.0 | Single | 1.0 | 1 | 2 | 1 | 0.0 | Manager | 20993.0 |
| 1 | 200001 | 0 | 49.0 | Company Invited | 1 | 14.0 | Salaried | Male | 3 | 4.0 | Deluxe | 4.0 | Divorced | 2.0 | 0 | 3 | 1 | 2.0 | Manager | 20130.0 |
| 2 | 200002 | 1 | 37.0 | Self Enquiry | 1 | 8.0 | Free Lancer | Male | 3 | 4.0 | Basic | 3.0 | Single | 7.0 | 1 | 3 | 0 | 0.0 | Executive | 17090.0 |
| 3 | 200003 | 0 | 33.0 | Company Invited | 1 | 9.0 | Salaried | Female | 2 | 3.0 | Basic | 3.0 | Divorced | 2.0 | 1 | 5 | 1 | 1.0 | Executive | 17909.0 |
| 4 | 200004 | 0 | NaN | Self Enquiry | 1 | 8.0 | Small Business | Male | 2 | 3.0 | Basic | 4.0 | Divorced | 1.0 | 0 | 5 | 1 | 0.0 | Executive | 18468.0 |
data.tail()
| CustomerID | ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | Designation | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4883 | 204883 | 1 | 49.0 | Self Enquiry | 3 | 9.0 | Small Business | Male | 3 | 5.0 | Deluxe | 4.0 | Unmarried | 2.0 | 1 | 1 | 1 | 1.0 | Manager | 26576.0 |
| 4884 | 204884 | 1 | 28.0 | Company Invited | 1 | 31.0 | Salaried | Male | 4 | 5.0 | Basic | 3.0 | Single | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 21212.0 |
| 4885 | 204885 | 1 | 52.0 | Self Enquiry | 3 | 17.0 | Salaried | Female | 4 | 4.0 | Standard | 4.0 | Married | 7.0 | 0 | 1 | 1 | 3.0 | Senior Manager | 31820.0 |
| 4886 | 204886 | 1 | 19.0 | Self Enquiry | 3 | 16.0 | Small Business | Male | 3 | 4.0 | Basic | 3.0 | Single | 3.0 | 0 | 5 | 0 | 2.0 | Executive | 20289.0 |
| 4887 | 204887 | 1 | 36.0 | Self Enquiry | 1 | 14.0 | Salaried | Male | 4 | 4.0 | Basic | 4.0 | Unmarried | 3.0 | 1 | 3 | 1 | 2.0 | Executive | 24041.0 |
We can get a first statistical and descriptive analysis using pandas_profiling
# get pandas profiling report
pandas_profiling.ProfileReport(data)
Pandas Profiling report is showing some warnings/characteristics in the data:
# get stats for the columns
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CustomerID | 4888.0 | 202443.500000 | 1411.188388 | 200000.0 | 201221.75 | 202443.5 | 203665.25 | 204887.0 |
| ProdTaken | 4888.0 | 0.188216 | 0.390925 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Age | 4662.0 | 37.622265 | 9.316387 | 18.0 | 31.00 | 36.0 | 44.00 | 61.0 |
| CityTier | 4888.0 | 1.654255 | 0.916583 | 1.0 | 1.00 | 1.0 | 3.00 | 3.0 |
| DurationOfPitch | 4637.0 | 15.490835 | 8.519643 | 5.0 | 9.00 | 13.0 | 20.00 | 127.0 |
| NumberOfPersonVisiting | 4888.0 | 2.905074 | 0.724891 | 1.0 | 2.00 | 3.0 | 3.00 | 5.0 |
| NumberOfFollowups | 4843.0 | 3.708445 | 1.002509 | 1.0 | 3.00 | 4.0 | 4.00 | 6.0 |
| PreferredPropertyStar | 4862.0 | 3.581037 | 0.798009 | 3.0 | 3.00 | 3.0 | 4.00 | 5.0 |
| NumberOfTrips | 4748.0 | 3.236521 | 1.849019 | 1.0 | 2.00 | 3.0 | 4.00 | 22.0 |
| Passport | 4888.0 | 0.290917 | 0.454232 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
| PitchSatisfactionScore | 4888.0 | 3.078151 | 1.365792 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| OwnCar | 4888.0 | 0.620295 | 0.485363 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| NumberOfChildrenVisiting | 4822.0 | 1.187267 | 0.857861 | 0.0 | 1.00 | 1.0 | 2.00 | 3.0 |
| MonthlyIncome | 4655.0 | 23619.853491 | 5380.698361 | 1000.0 | 20346.00 | 22347.0 | 25571.00 | 98678.0 |
ProdTaken
Age
TypeofContact
CityTier
Occupation
Gender
NumberOfPersonVisiting
PreferredPropertyStar
MaritalStatus
NumberOfTrips
Passport
OwnCar
NumberOfChildrenVisiting
Designation
MonthlyIncome
PitchSatisfactionScore
ProductPitched
NumberOfFollowups
DurationOfPitch
We are going to perform bivariate analysis to understand the relationship between the columns
# Continuous columns + ProdTaken
con_col = ['ProdTaken', 'Age', 'DurationOfPitch', 'NumberOfTrips', 'MonthlyIncome']
# Categorical columns
cat_col = ['TypeofContact', 'CityTier', 'Occupation', 'Gender', 'NumberOfPersonVisiting', 'NumberOfFollowups',
'ProductPitched', 'PreferredPropertyStar', 'MaritalStatus', 'Passport', 'PitchSatisfactionScore',
'OwnCar', 'NumberOfChildrenVisiting', 'Designation']
# Pairplot for continuous columns
sns.pairplot(data[con_col], diag_kind='kde', hue='ProdTaken');
# Get correlation matrix for numeric variables
data[con_col].corr()
| ProdTaken | Age | DurationOfPitch | NumberOfTrips | MonthlyIncome | |
|---|---|---|---|---|---|
| ProdTaken | 1.000000 | -0.147254 | 0.078257 | 0.018898 | -0.130585 |
| Age | -0.147254 | 1.000000 | -0.012063 | 0.184905 | 0.464869 |
| DurationOfPitch | 0.078257 | -0.012063 | 1.000000 | 0.009715 | -0.006252 |
| NumberOfTrips | 0.018898 | 0.184905 | 0.009715 | 1.000000 | 0.139105 |
| MonthlyIncome | -0.130585 | 0.464869 | -0.006252 | 0.139105 | 1.000000 |
# Display correlation matrix in a heatmap
sns.heatmap(data[con_col].corr(), annot=True);
### Function to plot stacked bar charts for categorical columns
def stacked_plot(x,hue):
tab = 100*pd.crosstab(x,hue,normalize='index').sort_values(by=hue[0])
print(tab.T)
tab.plot(kind='bar',stacked=True)
stacked_plot(data['TypeofContact'], data['ProdTaken'])
TypeofContact Self Enquiry Company Invited ProdTaken 0 82.375145 78.153629 1 17.624855 21.846371
stacked_plot(data['CityTier'], data['ProdTaken'])
CityTier 1 2 3 ProdTaken 0 83.69906 76.767677 76.4 1 16.30094 23.232323 23.6
stacked_plot(data['Occupation'], data['ProdTaken'])
Occupation Salaried Small Business Large Business Free Lancer ProdTaken 0 82.516892 81.573896 72.35023 0.0 1 17.483108 18.426104 27.64977 100.0
stacked_plot(data['Gender'], data['ProdTaken'])
Gender Fe Male Female Male ProdTaken 0 83.870968 82.55366 80.178326 1 16.129032 17.44634 19.821674
stacked_plot(data['NumberOfPersonVisiting'], data['ProdTaken'])
NumberOfPersonVisiting 1 5 4 2 3 ProdTaken 0 100.0 100.0 81.189084 81.170663 80.849292 1 0.0 0.0 18.810916 18.829337 19.150708
stacked_plot(data['NumberOfFollowups'], data['ProdTaken'])
NumberOfFollowups 2.0 1.0 3.0 4.0 5.0 \ ProdTaken 0 89.519651 88.636364 83.356071 81.673114 75.130208 1 10.480349 11.363636 16.643929 18.326886 24.869792 NumberOfFollowups 6.0 ProdTaken 0 60.294118 1 39.705882
stacked_plot(data['Designation'], data['ProdTaken'])
Designation AVP VP Manager Senior Manager Executive ProdTaken 0 94.152047 91.304348 88.221709 83.28841 70.032573 1 5.847953 8.695652 11.778291 16.71159 29.967427
stacked_plot(data['PreferredPropertyStar'], data['ProdTaken'])
PreferredPropertyStar 3.0 4.0 5.0 ProdTaken 0 83.895757 80.065717 73.849372 1 16.104243 19.934283 26.150628
stacked_plot(data['MaritalStatus'], data['ProdTaken'])
MaritalStatus Divorced Married Unmarried Single ProdTaken 0 86.947368 86.068376 75.659824 66.812227 1 13.052632 13.931624 24.340176 33.187773
stacked_plot(data['Passport'], data['ProdTaken'])
Passport 0 1 ProdTaken 0 87.709175 65.260197 1 12.290825 34.739803
stacked_plot(data['PitchSatisfactionScore'], data['ProdTaken'])
PitchSatisfactionScore 2 1 4 3 5 ProdTaken 0 84.982935 84.713376 82.236842 78.619756 78.350515 1 15.017065 15.286624 17.763158 21.380244 21.649485
stacked_plot(data['OwnCar'], data['ProdTaken'])
OwnCar 1 0 ProdTaken 0 81.530343 80.603448 1 18.469657 19.396552
stacked_plot(data['NumberOfChildrenVisiting'], data['ProdTaken'])
NumberOfChildrenVisiting 0.0 1.0 2.0 3.0 ProdTaken 0 81.330869 81.153846 81.048689 79.692308 1 18.669131 18.846154 18.951311 20.307692
sns.boxplot(data['ProdTaken'], data['Age']);
sns.boxplot(data['ProdTaken'], data['DurationOfPitch']);
sns.boxplot(data['ProdTaken'], data['NumberOfTrips']);
sns.boxplot(data['ProdTaken'], data['MonthlyIncome']);
From the previous Exploratory Data Analysis, we can determine some characteristics of the customers that bought a package:
Now, we are going to analyze the characteristics of the customers that bought a package for each different product
# data filtered by ProdTaken
data_prodtaken = data[data['ProdTaken']==1]
# pivot table to calculate number of customers in all Designation, Product Pitched combination
pd.pivot_table(data, values='CustomerID', index=['ProductPitched'], columns=['Designation'], aggfunc='count')
| Designation | AVP | Executive | Manager | Senior Manager | VP |
|---|---|---|---|---|---|
| ProductPitched | |||||
| Basic | NaN | 1842.0 | NaN | NaN | NaN |
| Deluxe | NaN | NaN | 1732.0 | NaN | NaN |
| King | NaN | NaN | NaN | NaN | 230.0 |
| Standard | NaN | NaN | NaN | 742.0 | NaN |
| Super Deluxe | 342.0 | NaN | NaN | NaN | NaN |
stacked_plot(data['ProductPitched'], data_prodtaken['TypeofContact'])
ProductPitched Super Deluxe Basic Deluxe Standard King TypeofContact Company Invited 80.0 35.336976 33.333333 25.806452 0.0 Self Enquiry 20.0 64.663024 66.666667 74.193548 100.0
stacked_plot(data['ProductPitched'], data['CityTier'])
ProductPitched Basic King Super Deluxe Standard Deluxe CityTier 1 79.587405 73.043478 61.988304 58.760108 52.424942 2 5.863192 9.565217 2.923977 2.425876 2.309469 3 14.549403 17.391304 35.087719 38.814016 45.265589
stacked_plot(data['ProductPitched'], data['Occupation'])
ProductPitched Standard Deluxe King Basic Super Deluxe Occupation Free Lancer 0.000000 0.000000 0.000000 0.108578 0.000000 Large Business 11.320755 7.159353 5.217391 10.640608 5.263158 Salaried 45.552561 47.228637 49.565217 50.162866 50.877193 Small Business 43.126685 45.612009 45.217391 39.087948 43.859649
stacked_plot(data['ProductPitched'], data['Gender'])
ProductPitched Standard King Basic Deluxe Super Deluxe Gender Fe Male 8.490566 0.000000 0.217155 4.792148 1.461988 Female 35.444744 35.652174 36.699240 37.009238 45.321637 Male 56.064690 64.347826 63.083605 58.198614 53.216374
stacked_plot(data['ProductPitched'], data['NumberOfPersonVisiting'])
ProductPitched Super Deluxe Standard Deluxe Basic \ NumberOfPersonVisiting 1 1.754386 0.943396 0.981524 0.488599 2 30.994152 28.167116 29.503464 28.067318 3 48.245614 48.921833 48.960739 49.457112 4 19.005848 21.832884 20.496536 21.932682 5 0.000000 0.134771 0.057737 0.054289 ProductPitched King NumberOfPersonVisiting 1 0.000000 2 32.608696 3 50.000000 4 17.391304 5 0.000000
stacked_plot(data['ProductPitched'], data['NumberOfFollowups'])
ProductPitched Standard Deluxe Basic Super Deluxe King NumberOfFollowups 1.0 2.291105 3.823529 3.936577 5.263158 1.739130 2.0 3.099730 5.058824 5.030071 6.432749 2.608696 3.0 29.784367 29.941176 30.399125 30.701754 32.608696 4.0 44.878706 42.588235 41.935484 42.105263 43.478261 5.0 17.924528 15.470588 15.746309 14.035088 15.652174 6.0 2.021563 3.117647 2.952433 1.461988 3.913043
stacked_plot(data['ProductPitched'],data['PreferredPropertyStar'])
ProductPitched Standard Basic Super Deluxe Deluxe \ PreferredPropertyStar 3.0 58.839406 60.727865 61.988304 62.969382 4.0 17.543860 19.989136 18.128655 18.024263 5.0 23.616734 19.282998 19.883041 19.006355 ProductPitched King PreferredPropertyStar 3.0 66.183575 4.0 19.806763 5.0 14.009662
stacked_plot(data['ProductPitched'], data['MaritalStatus'])
ProductPitched Standard Deluxe King Super Deluxe Basic MaritalStatus Divorced 19.137466 19.399538 22.608696 25.730994 18.023887 Married 51.212938 49.191686 54.782609 48.538012 44.299674 Single 6.738544 12.759815 22.608696 23.976608 27.741585 Unmarried 22.911051 18.648961 0.000000 1.754386 9.934853
stacked_plot(data['ProductPitched'], data['Passport'])
ProductPitched King Deluxe Standard Super Deluxe Basic Passport 0 73.913043 72.286374 71.698113 69.590643 69.163952 1 26.086957 27.713626 28.301887 30.409357 30.836048
stacked_plot(data['ProductPitched'], data['PitchSatisfactionScore'])
ProductPitched Super Deluxe Basic Deluxe Standard \ PitchSatisfactionScore 1 13.450292 19.435396 20.323326 19.676550 2 7.602339 11.074919 12.702079 13.746631 3 36.842105 29.858849 31.870670 27.223720 4 14.035088 21.172638 16.166282 19.407008 5 28.070175 18.458198 18.937644 19.946092 ProductPitched King PitchSatisfactionScore 1 17.391304 2 14.782609 3 20.869565 4 21.739130 5 25.217391
stacked_plot(data['ProductPitched'], data['OwnCar'])
ProductPitched Basic Deluxe Standard Super Deluxe King OwnCar 0 41.150923 38.91455 34.770889 29.824561 27.826087 1 58.849077 61.08545 65.229111 70.175439 72.173913
stacked_plot(data['ProductPitched'], data['NumberOfChildrenVisiting'])
ProductPitched King Basic Deluxe Standard \ NumberOfChildrenVisiting 0.0 14.832536 21.944595 22.588099 23.211876 1.0 42.583732 43.074416 43.558637 42.914980 2.0 34.449761 28.082564 27.383016 26.720648 3.0 8.133971 6.898425 6.470248 7.152497 ProductPitched Super Deluxe NumberOfChildrenVisiting 0.0 28.000000 1.0 42.000000 2.0 24.666667 3.0 5.333333
sns.boxplot(data['ProductPitched'], data['Age']);
sns.boxplot(data['ProductPitched'], data['DurationOfPitch']);
sns.boxplot(data['ProductPitched'], data['NumberOfTrips']);
sns.boxplot(data['ProductPitched'], data['MonthlyIncome']);
These are the main characteristics of the customers that bought the different packages
# Drop CustomerID column
data.drop(['CustomerID'], axis=1, inplace=True)
# Drop Designation column
data.drop(['Designation'], axis=1, inplace=True)
# This function correct 'Fe Male' values in Gender column
def correct_gender(gender):
if gender == 'Fe Male':
gender = 'Female'
return gender
# apply correct_gender function to column 'Gender'
data['Gender'] = data['Gender'].apply(correct_gender)
# check Gender column
data['Gender'].unique()
array(['Female', 'Male'], dtype=object)
# This function convert "Unmarried" to "Single" in MaritalStatus column
def convert_unmarried(mstatus):
if mstatus == 'Unmarried':
mstatus = 'Single'
return mstatus
# apply convert_unmarried function to column 'MaritalStatus'
data['MaritalStatus'] = data['MaritalStatus'].apply(convert_unmarried)
# check MaritalStatus column
data['MaritalStatus'].unique()
array(['Single', 'Divorced', 'Married'], dtype=object)
# convert columns to category column
data['TypeofContact']=data['TypeofContact'].astype('category')
data['Occupation']=data['Occupation'].astype('category')
data['Gender']=data['Gender'].astype('category')
data['MaritalStatus']=data['MaritalStatus'].astype('category')
data['ProductPitched']=data['ProductPitched'].astype('category')
# convert columns to numeric column
data['DurationOfPitch']=data['DurationOfPitch'].astype('float')
# check data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4662 non-null float64 2 TypeofContact 4863 non-null category 3 CityTier 4888 non-null int64 4 DurationOfPitch 4637 non-null float64 5 Occupation 4888 non-null category 6 Gender 4888 non-null category 7 NumberOfPersonVisiting 4888 non-null int64 8 NumberOfFollowups 4843 non-null float64 9 ProductPitched 4888 non-null category 10 PreferredPropertyStar 4862 non-null float64 11 MaritalStatus 4888 non-null category 12 NumberOfTrips 4748 non-null float64 13 Passport 4888 non-null int64 14 PitchSatisfactionScore 4888 non-null int64 15 OwnCar 4888 non-null int64 16 NumberOfChildrenVisiting 4822 non-null float64 17 MonthlyIncome 4655 non-null float64 dtypes: category(5), float64(7), int64(6) memory usage: 521.2 KB
# check dataset missing values
total = data.isnull().sum().sort_values(ascending=False) # total number of null values
print(total)
DurationOfPitch 251 MonthlyIncome 233 Age 226 NumberOfTrips 140 NumberOfChildrenVisiting 66 NumberOfFollowups 45 PreferredPropertyStar 26 TypeofContact 25 OwnCar 0 PitchSatisfactionScore 0 Passport 0 ProdTaken 0 MaritalStatus 0 NumberOfPersonVisiting 0 Gender 0 Occupation 0 CityTier 0 ProductPitched 0 dtype: int64
From the columns with missing data, TypeofContact is categorical while DurationOfPitch, MonthlyIncome, Age, NumberOfTrips, NumberOfChildrenVisiting, NumberOfFollowups, PreferredPropertyStar are numerical columns
# counting the number of missing values per row
num_missing = data.isnull().sum(axis=1)
num_missing.value_counts()
0 4128 1 533 2 202 3 25 dtype: int64
We are going to analyze if there is a pattern for the 25 rows with 3 missing values
data[num_missing == 3]
| ProdTaken | Age | TypeofContact | CityTier | DurationOfPitch | Occupation | Gender | NumberOfPersonVisiting | NumberOfFollowups | ProductPitched | PreferredPropertyStar | MaritalStatus | NumberOfTrips | Passport | PitchSatisfactionScore | OwnCar | NumberOfChildrenVisiting | MonthlyIncome | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 224 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | Deluxe | 3.0 | Divorced | 1.0 | 0 | 3 | 1 | 0.0 | NaN |
| 571 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 4.0 | 0 | 4 | 1 | 2.0 | NaN |
| 572 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | Deluxe | 3.0 | Divorced | 5.0 | 0 | 2 | 1 | 0.0 | NaN |
| 576 | 0 | 27.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 3.0 | Divorced | 1.0 | 0 | 3 | 0 | 1.0 | NaN |
| 579 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Basic | 5.0 | Single | 2.0 | 0 | 2 | 1 | 1.0 | NaN |
| 598 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 7.0 | 0 | 3 | 0 | 0.0 | NaN |
| 622 | 0 | 32.0 | NaN | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 2 | 0 | 0.0 | NaN |
| 724 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 2.0 | 0 | 3 | 1 | 1.0 | NaN |
| 843 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Basic | 3.0 | Divorced | 2.0 | 0 | 5 | 1 | 1.0 | NaN |
| 1021 | 1 | 25.0 | NaN | 3 | NaN | Salaried | Male | 3 | 4.0 | Basic | 5.0 | Divorced | 4.0 | 0 | 1 | 1 | 0.0 | NaN |
| 1047 | 0 | 33.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 5.0 | Divorced | 1.0 | 0 | 3 | 0 | 0.0 | NaN |
| 1143 | 0 | 45.0 | NaN | 3 | NaN | Small Business | Male | 2 | 4.0 | Deluxe | 5.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | NaN |
| 1182 | 0 | 36.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 5 | 1 | 1.0 | NaN |
| 1217 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Male | 3 | 1.0 | Basic | 3.0 | Married | 2.0 | 0 | 1 | 0 | 0.0 | NaN |
| 1356 | 0 | 41.0 | NaN | 3 | NaN | Small Business | Female | 2 | 3.0 | Deluxe | 4.0 | Married | 6.0 | 0 | 3 | 1 | 1.0 | NaN |
| 1469 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 3 | 0 | 1.0 | NaN |
| 1694 | 0 | 31.0 | NaN | 1 | NaN | Small Business | Male | 2 | 5.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 0 | 0.0 | NaN |
| 2041 | 0 | 26.0 | NaN | 1 | NaN | Salaried | Female | 3 | 5.0 | Basic | 3.0 | Married | 4.0 | 0 | 4 | 1 | 0.0 | NaN |
| 2042 | 0 | 29.0 | NaN | 1 | NaN | Small Business | Female | 3 | 3.0 | Deluxe | 3.0 | Married | 5.0 | 0 | 1 | 0 | 1.0 | NaN |
| 2046 | 0 | 27.0 | NaN | 3 | NaN | Small Business | Male | 2 | 3.0 | Deluxe | 3.0 | Married | 1.0 | 0 | 3 | 1 | 1.0 | NaN |
| 2049 | 0 | 34.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Basic | 5.0 | Single | 2.0 | 0 | 1 | 1 | 0.0 | NaN |
| 2068 | 1 | 28.0 | NaN | 1 | NaN | Small Business | Male | 2 | 3.0 | Basic | 3.0 | Single | 7.0 | 0 | 3 | 1 | 1.0 | NaN |
| 2092 | 0 | 32.0 | NaN | 3 | NaN | Salaried | Male | 3 | 3.0 | Deluxe | 3.0 | Married | 3.0 | 0 | 1 | 0 | 2.0 | NaN |
| 2194 | 0 | 24.0 | NaN | 1 | NaN | Small Business | Female | 2 | 4.0 | Deluxe | 3.0 | Married | 2.0 | 0 | 3 | 0 | 0.0 | NaN |
| 2313 | 0 | 26.0 | NaN | 1 | NaN | Small Business | Male | 2 | 1.0 | Basic | 3.0 | Married | 2.0 | 0 | 5 | 1 | 1.0 | NaN |
For these 25 rows, the 3 missing columns are: TypeofContact, DurationOfPitch and MonthlyIncome
Now, we are going to get the columns with missing values
for n in num_missing.value_counts().sort_index().index:
if n > 0:
print(f'Rows with exactly {n} missing values, NAs are found in:')
n_miss_per_col = data[num_missing == n].isnull().sum()
print(n_miss_per_col[n_miss_per_col > 0])
print('\n')
Rows with exactly 1 missing values, NAs are found in: Age 96 DurationOfPitch 154 NumberOfFollowups 45 PreferredPropertyStar 26 NumberOfTrips 140 NumberOfChildrenVisiting 66 MonthlyIncome 6 dtype: int64 Rows with exactly 2 missing values, NAs are found in: Age 130 DurationOfPitch 72 MonthlyIncome 202 dtype: int64 Rows with exactly 3 missing values, NAs are found in: TypeofContact 25 DurationOfPitch 25 MonthlyIncome 25 dtype: int64
# load KNNImputer
from sklearn.impute import KNNImputer
imputer = KNNImputer()
# create data set with only numeric columns
data_n = data.select_dtypes(include=np.number)
data_n_cols = data_n.columns.tolist()
data_n.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null int64 1 Age 4662 non-null float64 2 CityTier 4888 non-null int64 3 DurationOfPitch 4637 non-null float64 4 NumberOfPersonVisiting 4888 non-null int64 5 NumberOfFollowups 4843 non-null float64 6 PreferredPropertyStar 4862 non-null float64 7 NumberOfTrips 4748 non-null float64 8 Passport 4888 non-null int64 9 PitchSatisfactionScore 4888 non-null int64 10 OwnCar 4888 non-null int64 11 NumberOfChildrenVisiting 4822 non-null float64 12 MonthlyIncome 4655 non-null float64 dtypes: float64(7), int64(6) memory usage: 496.6 KB
# input values with KNNImputer
data_n = pd.DataFrame(imputer.fit_transform(data_n), columns=data_n_cols)
data_n.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null float64 1 Age 4888 non-null float64 2 CityTier 4888 non-null float64 3 DurationOfPitch 4888 non-null float64 4 NumberOfPersonVisiting 4888 non-null float64 5 NumberOfFollowups 4888 non-null float64 6 PreferredPropertyStar 4888 non-null float64 7 NumberOfTrips 4888 non-null float64 8 Passport 4888 non-null float64 9 PitchSatisfactionScore 4888 non-null float64 10 OwnCar 4888 non-null float64 11 NumberOfChildrenVisiting 4888 non-null float64 12 MonthlyIncome 4888 non-null float64 dtypes: float64(13) memory usage: 496.6 KB
# replace columns with new imputed columns
for feature in data_n_cols:
data[feature] = data_n[feature]
# input TypeofContact column with the most frequent value (mode imputation)
data['TypeofContact'][data['TypeofContact'].isna()]='Self Enquiry'
# Check there are not missing values
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4888 entries, 0 to 4887 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ProdTaken 4888 non-null float64 1 Age 4888 non-null float64 2 TypeofContact 4888 non-null category 3 CityTier 4888 non-null float64 4 DurationOfPitch 4888 non-null float64 5 Occupation 4888 non-null category 6 Gender 4888 non-null category 7 NumberOfPersonVisiting 4888 non-null float64 8 NumberOfFollowups 4888 non-null float64 9 ProductPitched 4888 non-null category 10 PreferredPropertyStar 4888 non-null float64 11 MaritalStatus 4888 non-null category 12 NumberOfTrips 4888 non-null float64 13 Passport 4888 non-null float64 14 PitchSatisfactionScore 4888 non-null float64 15 OwnCar 4888 non-null float64 16 NumberOfChildrenVisiting 4888 non-null float64 17 MonthlyIncome 4888 non-null float64 dtypes: category(5), float64(13) memory usage: 521.2 KB
We are going to analyze the outliers in DurationOfPitch, MonthlyIncome and NumberOfTrips columns
# This function create a plot with the boxplot and distribution of the series
def histboxplot(feature):
# creating the 2 subplots
f2, (ax_box1, ax_hist1) = plt.subplots(nrows = 2, # Number of rows of the subplot grid= 2
ncols = 1, # Number of Columns of the subplot grid= 1
sharex = 'col', # x-axis will be shared among columns
figsize = (7,6),
gridspec_kw = {"height_ratios": (.25, .75)});
sns.boxplot(feature, ax=ax_box1, showmeans=True, color='violet'); # boxplot
sns.distplot(feature, kde=True, ax=ax_hist1); # histogram
ax_hist1.axvline(np.mean(feature), color='green', linestyle='--'); # Add mean to the histogram
ax_hist1.axvline(np.median(feature), color='black', linestyle='-');
# This function treats the outliers in a variable
def treat_outliers(feature, lower, upper):
# all the values smaller than lower will be assigned value of lower
# and all the values above upper will be assigned value of upper
feature = np.clip(feature, Lower_Whisker, Upper_Whisker)
return feature
histboxplot(data['DurationOfPitch'])
DurationOfPitch has several outliers. All values above 37 are going to be clipped
data['DurationOfPitch']=np.clip(data['DurationOfPitch'], 0, 37)
histboxplot(data['MonthlyIncome'])
MonthlyIncome has several outliers. All values below 15000 and above 40000 are going to be clipped
data['MonthlyIncome']=np.clip(data['MonthlyIncome'], 15000, 40000)
histboxplot(data['NumberOfTrips'])
NumberOfTrips has values above 10. However, it is possible for a customer to have more than 10 trips in a year. Therefore, we are not going to modify these values
# create independent variables
X = data.drop(['ProdTaken'], axis=1)
# create dependent variable
Y = data['ProdTaken']
# hot encoding for categorical variables
X = pd.get_dummies(X,drop_first=True)
#Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30, random_state = 1, stratify=Y)
print(f'Shape of Training set: {X_train.shape}')
print(f'Shape of Test set: {X_test.shape}')
print(f'Percentage of classes in Training set\n{y_train.value_counts(normalize=True)}')
print(f'Percentage of classes in Test set\n{y_test.value_counts(normalize=True)}')
Shape of Training set: (3421, 23) Shape of Test set: (1467, 23) Percentage of classes in Training set 0.0 0.811751 1.0 0.188249 Name: ProdTaken, dtype: float64 Percentage of classes in Test set 0.0 0.811861 1.0 0.188139 Name: ProdTaken, dtype: float64
Both, training set and test set have similar ratios of classes for ProdTaken
True Positives:
True Negatives:
False Positives:
False Negatives:
Visit with us, and the business would not growF1-score should be used as a measured of the model performance. High f1-score implies both low False Negatives and low False Positivesdef metrics_score(model,train,test,train_y,test_y,threshold=0.5,model_name=''):
'''
Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
model: classifier to predict values of X
train, test: Independent features
train_y,test_y: Dependent variable
threshold: thresold for classifiying the observation as 1
flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
model_name: name of model
'''
pred_train = (model.predict_proba(train)[:,1]>threshold)
pred_test = (model.predict_proba(test)[:,1]>threshold)
score_dict = {'Model':model_name,
'Accuracy on training set' : accuracy_score(pred_train,train_y),
'Accuracy on test set': accuracy_score(pred_test,test_y),
'Recall on training set': recall_score(train_y,pred_train),
'Recall on test set': recall_score(test_y,pred_test),
'Precision on training set': precision_score(train_y,pred_train),
'Precision on test set': precision_score(test_y,pred_test),
'F1 on training set': f1_score(train_y,pred_train),
'F1 on test set': f1_score(test_y,pred_test)
}
return score_dict # returning dictionary with scores
def make_confusion_matrix(model,test_X,y_actual,threshold=0.5,labels=[1, 0]):
'''
model : classifier to predict values of X
test_X: test set
y_actual : ground truth
threshold: thresold for classifiying the observation as 1
'''
y_predict = (model.predict_proba(test_X)[:, 1] > threshold).astype('float')
cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1,0])
df_cm = pd.DataFrame(cm,
index = ['Actual - Prod Taken','Actual - No Prod Taken'],
columns = ['Predicted-Prod Taken','Predicted-No Prod Taken'])
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.1%}".format(value) for value in cm.flatten()/np.sum(cm)]
group_labels = ['(TP)', '(FN)', '(FP)', '(TN)']
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_counts,group_percentages,group_labels)]
labels = np.asarray(labels).reshape(2,2)
plt.figure(figsize = (8,6))
sns.heatmap(df_cm, annot=labels, fmt='')
plt.ylabel('True label')
plt.xlabel('Predicted label')
model_dt = DecisionTreeClassifier(criterion="gini", class_weight={0: 0.18, 1: 0.82}, random_state=1)
model_dt.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.18, 1: 0.82}, random_state=1)
# checking model performances for this model
scores = metrics_score(model_dt,X_train,X_test,y_train,y_test,model_name='Decision Tree')
scores = pd.DataFrame(scores,index=[0])
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.0 | 0.884117 | 1.0 | 0.702899 | 1.0 | 0.687943 | 1.0 | 0.695341 |
# creating confusion matrix
make_confusion_matrix(model_dt,X_test,y_test)
model_bagging = BaggingClassifier(random_state=1)
model_bagging.fit(X_train,y_train)
BaggingClassifier(random_state=1)
# checking model performances for this model
scores_bagging = metrics_score(model_bagging,X_train,X_test,y_train,y_test,model_name='Bagging Classifier')
scores = scores.append(scores_bagging, ignore_index=True)
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.0 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.0 | 0.906593 | 0.987421 | 0.720524 |
# creating confusion matrix
make_confusion_matrix(model_bagging,X_test,y_test)
model_bagging_dt = BaggingClassifier(
base_estimator=DecisionTreeClassifier(criterion='gini',class_weight={0:0.18,1:0.82},random_state=1),
random_state=1)
model_bagging_dt.fit(X_train,y_train)
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.18,
1: 0.82},
random_state=1),
random_state=1)
# checking model performances for this model
scores_bagging_dt = metrics_score(model_bagging_dt,X_train,X_test,y_train,y_test,model_name='Bagging Classifier with Weights')
scores = scores.append(scores_bagging_dt, ignore_index=True)
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.0 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.0 | 0.906593 | 0.987421 | 0.720524 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.0 | 0.911950 | 0.984227 | 0.666667 |
# creating confusion matrix
make_confusion_matrix(model_bagging_dt,X_test,y_test)
model_random_forest = RandomForestClassifier(random_state=1)
model_random_forest.fit(X_train,y_train)
RandomForestClassifier(random_state=1)
# checking model performances for this model
scores_random_forest = metrics_score(model_random_forest,X_train,X_test,y_train,y_test,model_name='Random Forest')
scores = scores.append(scores_random_forest, ignore_index=True)
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.0 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.0 | 0.906593 | 0.987421 | 0.720524 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.0 | 0.911950 | 0.984227 | 0.666667 |
| 3 | Random Forest | 1.000000 | 0.909339 | 1.000000 | 0.547101 | 1.0 | 0.949686 | 1.000000 | 0.694253 |
# creating confusion matrix
make_confusion_matrix(model_random_forest,X_test,y_test)
model_random_forest_wt = RandomForestClassifier(class_weight={0:0.18,1:0.82}, random_state=1)
model_random_forest_wt.fit(X_train,y_train)
RandomForestClassifier(class_weight={0: 0.18, 1: 0.82}, random_state=1)
# checking model performances for this model
scores_random_forest_wt = metrics_score(model_random_forest_wt,X_train,X_test,y_train,y_test,model_name='Random Forest with Weights')
scores = scores.append(scores_random_forest_wt, ignore_index=True)
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.0 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.0 | 0.906593 | 0.987421 | 0.720524 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.0 | 0.911950 | 0.984227 | 0.666667 |
| 3 | Random Forest | 1.000000 | 0.909339 | 1.000000 | 0.547101 | 1.0 | 0.949686 | 1.000000 | 0.694253 |
| 4 | Random Forest with Weights | 1.000000 | 0.906612 | 1.000000 | 0.521739 | 1.0 | 0.966443 | 1.000000 | 0.677647 |
# creating confusion matrix
make_confusion_matrix(model_random_forest,X_test,y_test)
# Print scores for all models
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.0 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.0 | 0.906593 | 0.987421 | 0.720524 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.0 | 0.911950 | 0.984227 | 0.666667 |
| 3 | Random Forest | 1.000000 | 0.909339 | 1.000000 | 0.547101 | 1.0 | 0.949686 | 1.000000 | 0.694253 |
| 4 | Random Forest with Weights | 1.000000 | 0.906612 | 1.000000 | 0.521739 | 1.0 | 0.966443 | 1.000000 | 0.677647 |
We are going to use f1-score as metric performance with the goal of improve recall score without reduce precision considerably
Now we are going to improve and reduce the complexity of the Decision Tree using cost complexity pruning identifying the optimal ccp_alpha parameter
# Define Decision Tree and identify pairs of ccp_alphas and impurities
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.18, 1: 0.82})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# Graph of Total Impurity vs effective alpha for training set
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
We are going to train a decision tree using the effective alphas
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.18, 1: 0.82})
clf.fit(X_train, y_train)
clfs.append(clf)
# Graphs of Number of Nodes and Depth of tree vs alpha
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Now, we are going to analyze how the number of nodes and depth of tree reduces with higher alphas
# Get f1-scores for Train set
f1_train = []
for clf in clfs:
pred_train3 = clf.predict(X_train)
values_train = metrics.f1_score(y_train, pred_train3)
f1_train.append(values_train)
# Get f1-scores for Test set
f1_test = []
for clf in clfs:
pred_test3 = clf.predict(X_test)
values_test = metrics.f1_score(y_test, pred_test3)
f1_test.append(values_test)
# Create Lists with f1-scores scores
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
# Plot f1-Scores vs alpha
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("f1-score")
ax.set_title("f1-score vs alpha for training and testing sets")
ax.plot(
ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Model where we get highest train and test f1-score
index_best_model = np.argmax(f1_test)
dt_estimator = clfs[index_best_model]
# Train the model
dt_estimator.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.00032312936156255617,
class_weight={0: 0.18, 1: 0.82}, random_state=1)
# checking model performances for this model
scores_dt_estimator = metrics_score(dt_estimator,X_train,X_test,y_train,y_test,model_name='Tuned Decision Tree')
scores = scores.append(scores_dt_estimator, ignore_index=True)
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.00000 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.00000 | 0.906593 | 0.987421 | 0.720524 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.00000 | 0.911950 | 0.984227 | 0.666667 |
| 3 | Random Forest | 1.000000 | 0.909339 | 1.000000 | 0.547101 | 1.00000 | 0.949686 | 1.000000 | 0.694253 |
| 4 | Random Forest with Weights | 1.000000 | 0.906612 | 1.000000 | 0.521739 | 1.00000 | 0.966443 | 1.000000 | 0.677647 |
| 5 | Tuned Decision Tree | 0.991231 | 0.884117 | 1.000000 | 0.735507 | 0.95549 | 0.676667 | 0.977238 | 0.704861 |
# creating confusion matrix
make_confusion_matrix(dt_estimator,X_test,y_test)
param_grid = {'n_estimators':[5,7,15,51,101],
'max_features': [0.7,0.8,0.9,1]
}
grid_obj = GridSearchCV(model_bagging_dt, param_grid=param_grid, scoring = 'f1', cv = 5, verbose=2, n_jobs=-1)
grid_obj.fit(X_train, y_train)
# Get the best combination of parameters
bagging_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bagging_estimator.fit(X_train, y_train)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.18,
1: 0.82},
random_state=1),
max_features=0.9, n_estimators=101, random_state=1)
# checking model performances for this model
scores_bagging_estimator = metrics_score(bagging_estimator,X_train,X_test,y_train,y_test,model_name='Tuned Bagging Classifier')
scores = scores.append(scores_bagging_estimator, ignore_index=True)
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.00000 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.00000 | 0.906593 | 0.987421 | 0.720524 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.00000 | 0.911950 | 0.984227 | 0.666667 |
| 3 | Random Forest | 1.000000 | 0.909339 | 1.000000 | 0.547101 | 1.00000 | 0.949686 | 1.000000 | 0.694253 |
| 4 | Random Forest with Weights | 1.000000 | 0.906612 | 1.000000 | 0.521739 | 1.00000 | 0.966443 | 1.000000 | 0.677647 |
| 5 | Tuned Decision Tree | 0.991231 | 0.884117 | 1.000000 | 0.735507 | 0.95549 | 0.676667 | 0.977238 | 0.704861 |
| 6 | Tuned Bagging Classifier | 1.000000 | 0.920927 | 1.000000 | 0.608696 | 1.00000 | 0.954545 | 1.000000 | 0.743363 |
# creating confusion matrix
make_confusion_matrix(bagging_estimator,X_test,y_test)
# Grid of parameters to choose from
parameters = {
"n_estimators": [110,251,501],
"min_samples_leaf": np.arange(1,6,1),
"max_features": [0.7,0.9,'log2','auto'],
"max_samples": [0.7,0.9,None],
}
# Run the grid search
grid_obj = GridSearchCV(model_random_forest_wt, parameters, scoring='f1', cv=5, verbose=2, n_jobs=-1)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
Fitting 5 folds for each of 180 candidates, totalling 900 fits
RandomForestClassifier(class_weight={0: 0.18, 1: 0.82}, max_features=0.7,
min_samples_leaf=3, n_estimators=110, random_state=1)
# checking model performances for this model
scores_rf_estimator = metrics_score(rf_estimator,X_train,X_test,y_train,y_test,model_name='Tuned Random Forest')
scores = scores.append(scores_rf_estimator, ignore_index=True)
scores
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.000000 | 0.687943 | 1.000000 | 0.695341 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.000000 | 0.906593 | 0.987421 | 0.720524 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.000000 | 0.911950 | 0.984227 | 0.666667 |
| 3 | Random Forest | 1.000000 | 0.909339 | 1.000000 | 0.547101 | 1.000000 | 0.949686 | 1.000000 | 0.694253 |
| 4 | Random Forest with Weights | 1.000000 | 0.906612 | 1.000000 | 0.521739 | 1.000000 | 0.966443 | 1.000000 | 0.677647 |
| 5 | Tuned Decision Tree | 0.991231 | 0.884117 | 1.000000 | 0.735507 | 0.955490 | 0.676667 | 0.977238 | 0.704861 |
| 6 | Tuned Bagging Classifier | 1.000000 | 0.920927 | 1.000000 | 0.608696 | 1.000000 | 0.954545 | 1.000000 | 0.743363 |
| 7 | Tuned Random Forest | 0.990938 | 0.916837 | 0.996894 | 0.688406 | 0.956781 | 0.840708 | 0.976426 | 0.756972 |
# creating confusion matrix
make_confusion_matrix(bagging_estimator,X_test,y_test)
# Print scores for all models
scores.sort_values(by=['Recall on test set'], ascending=False)
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 5 | Tuned Decision Tree | 0.991231 | 0.884117 | 1.000000 | 0.735507 | 0.955490 | 0.676667 | 0.977238 | 0.704861 |
| 0 | Decision Tree | 1.000000 | 0.884117 | 1.000000 | 0.702899 | 1.000000 | 0.687943 | 1.000000 | 0.695341 |
| 7 | Tuned Random Forest | 0.990938 | 0.916837 | 0.996894 | 0.688406 | 0.956781 | 0.840708 | 0.976426 | 0.756972 |
| 6 | Tuned Bagging Classifier | 1.000000 | 0.920927 | 1.000000 | 0.608696 | 1.000000 | 0.954545 | 1.000000 | 0.743363 |
| 1 | Bagging Classifier | 0.995323 | 0.912747 | 0.975155 | 0.597826 | 1.000000 | 0.906593 | 0.987421 | 0.720524 |
| 3 | Random Forest | 1.000000 | 0.909339 | 1.000000 | 0.547101 | 1.000000 | 0.949686 | 1.000000 | 0.694253 |
| 2 | Bagging Classifier with Weights | 0.994154 | 0.901159 | 0.968944 | 0.525362 | 1.000000 | 0.911950 | 0.984227 | 0.666667 |
| 4 | Random Forest with Weights | 1.000000 | 0.906612 | 1.000000 | 0.521739 | 1.000000 | 0.966443 | 1.000000 | 0.677647 |
# importance of features in the tree building
impor_fea = pd.DataFrame(dt_estimator.feature_importances_, columns=["Imp"], index=X_train.columns).sort_values(by="Imp", ascending=False)
impor_fea = impor_fea.reset_index()
impor_fea.head(23)
| index | Imp | |
|---|---|---|
| 0 | MonthlyIncome | 0.137457 |
| 1 | Age | 0.133593 |
| 2 | DurationOfPitch | 0.117237 |
| 3 | Passport | 0.086212 |
| 4 | CityTier | 0.075115 |
| 5 | NumberOfTrips | 0.061350 |
| 6 | MaritalStatus_Single | 0.056090 |
| 7 | PreferredPropertyStar | 0.055702 |
| 8 | PitchSatisfactionScore | 0.049261 |
| 9 | NumberOfFollowups | 0.036707 |
| 10 | ProductPitched_Deluxe | 0.029724 |
| 11 | Occupation_Large Business | 0.029644 |
| 12 | Gender_Male | 0.023978 |
| 13 | TypeofContact_Self Enquiry | 0.020386 |
| 14 | NumberOfChildrenVisiting | 0.018070 |
| 15 | OwnCar | 0.015857 |
| 16 | ProductPitched_Standard | 0.014287 |
| 17 | Occupation_Small Business | 0.013586 |
| 18 | ProductPitched_Super Deluxe | 0.013234 |
| 19 | NumberOfPersonVisiting | 0.008874 |
| 20 | Occupation_Salaried | 0.002778 |
| 21 | ProductPitched_King | 0.000666 |
| 22 | MaritalStatus_Married | 0.000191 |
plt.figure(figsize=(12,12))
sns.barplot(x='Imp', y='index', data=impor_fea);
MonthlyIncome, Age and DurationOfPitch are the most important features in the Tuned Decision Tree
model_abc = AdaBoostClassifier(random_state=1)
model_abc.fit(X_train,y_train)
AdaBoostClassifier(random_state=1)
# checking model performances for this model
scores_boosting = metrics_score(model_abc,X_train,X_test,y_train,y_test,model_name='AdaBoost Classifier')
scores_boosting = pd.DataFrame(scores_boosting,index=[0])
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.32971 | 0.697279 | 0.716535 | 0.4371 | 0.451613 |
# creating confusion matrix
make_confusion_matrix(model_abc,X_test,y_test)
model_gbc = GradientBoostingClassifier(random_state=1)
model_gbc.fit(X_train,y_train)
GradientBoostingClassifier(random_state=1)
# checking model performances for this model
scores_gbc = metrics_score(model_gbc,X_train,X_test,y_train,y_test,model_name='Gradient Boosting Classifier')
scores_boosting = scores_boosting.append(scores_gbc, ignore_index=True)
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
# creating confusion matrix
make_confusion_matrix(model_gbc,X_test,y_test)
model_xgb = XGBClassifier(random_state=1, eval_metric='logloss', use_label_encoder=False)
model_xgb.fit(X_train,y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
use_label_encoder=False, validate_parameters=1, verbosity=None)
# checking model performances for this model
scores_xgb = metrics_score(model_xgb,X_train,X_test,y_train,y_test,model_name='XGBoost Classifier')
scores_boosting = scores_boosting.append(scores_xgb, ignore_index=True)
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
# creating confusion matrix
make_confusion_matrix(model_xgb,X_test,y_test)
Now, we are going to create a Stacking Classifier using the Tuned Decision Tree, AdaBoost Classifier, Tuned Random Forest and the XGBoost Classifier
estimators = [('tdt', dt_estimator),
('abc', model_abc),
('trf', rf_estimator),
('xgb', model_xgb)]
model_stacking = StackingClassifier(estimators=estimators, final_estimator=RandomForestClassifier(random_state=1),n_jobs=-1)
model_stacking.fit(X_train,y_train)
StackingClassifier(estimators=[('tdt',
DecisionTreeClassifier(ccp_alpha=0.00032312936156255617,
class_weight={0: 0.18,
1: 0.82},
random_state=1)),
('abc', AdaBoostClassifier(random_state=1)),
('trf',
RandomForestClassifier(class_weight={0: 0.18,
1: 0.82},
max_features=0.7,
min_samples_leaf=3,
n_estimators=110,
random_state=1)),
('xgb',
XGBClassifier(base_score=0.5, bo...
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()',
n_estimators=100, n_jobs=8,
num_parallel_tree=1,
random_state=1, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1,
subsample=1, tree_method='exact',
use_label_encoder=False,
validate_parameters=1,
verbosity=None))],
final_estimator=RandomForestClassifier(random_state=1),
n_jobs=-1)
# checking model performances for this model
scores_stacking = metrics_score(model_stacking,X_train,X_test,y_train,y_test,model_name='Stacking Classifier')
scores_boosting = scores_boosting.append(scores_stacking, ignore_index=True)
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
| 3 | Stacking Classifier | 0.999415 | 0.919564 | 1.000000 | 0.757246 | 0.996904 | 0.803846 | 0.998450 | 0.779851 |
# creating confusion matrix
make_confusion_matrix(model_stacking,X_test,y_test)
# Print scores for all models
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
| 3 | Stacking Classifier | 0.999415 | 0.919564 | 1.000000 | 0.757246 | 0.996904 | 0.803846 | 0.998450 | 0.779851 |
We are going to use f1-score as metric performance with the goal of improve recall score without reduce precision considerably
param_grid = {
"base_estimator":[DecisionTreeClassifier(max_depth=1),
DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3)],
"n_estimators": np.arange(20,210,20),
"learning_rate":np.arange(0.2,2,0.2)
}
grid_obj = GridSearchCV(model_abc, param_grid=param_grid, scoring='f1', cv=5, verbose=2, n_jobs=-1)
grid_obj.fit(X_train, y_train)
# Get the best combination of parameters
abc_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
abc_estimator.fit(X_train, y_train)
Fitting 5 folds for each of 270 candidates, totalling 1350 fits
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=1.6, n_estimators=160, random_state=1)
# checking model performances for this model
scores_abc_estimator = metrics_score(abc_estimator,X_train,X_test,y_train,y_test,model_name='Tuned AdaBoost Classifier')
scores_boosting = scores_boosting.append(scores_abc_estimator, ignore_index=True)
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
| 3 | Stacking Classifier | 0.999415 | 0.919564 | 1.000000 | 0.757246 | 0.996904 | 0.803846 | 0.998450 | 0.779851 |
| 4 | Tuned AdaBoost Classifier | 1.000000 | 0.888207 | 1.000000 | 0.659420 | 1.000000 | 0.722222 | 1.000000 | 0.689394 |
# creating confusion matrix
make_confusion_matrix(abc_estimator,X_test,y_test)
param_grid = {
"n_estimators": np.arange(20,210,20),
"learning_rate":np.arange(0.2,2,0.2),
"subsample":[0.8,1],
"max_features":[0.8,1]
}
grid_obj = GridSearchCV(model_gbc, param_grid=param_grid, scoring='f1', cv=5, verbose=2, n_jobs=-1)
grid_obj.fit(X_train, y_train)
# Get the best combination of parameters
gbc_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
gbc_estimator.fit(X_train, y_train)
Fitting 5 folds for each of 360 candidates, totalling 1800 fits
GradientBoostingClassifier(learning_rate=0.6000000000000001, max_features=0.8,
n_estimators=200, random_state=1, subsample=1)
# checking model performances for this model
scores_gbc_estimator = metrics_score(gbc_estimator,X_train,X_test,y_train,y_test,model_name='Tuned Gradient Boosting Classifier')
scores_boosting = scores_boosting.append(scores_gbc_estimator, ignore_index=True)
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
| 3 | Stacking Classifier | 0.999415 | 0.919564 | 1.000000 | 0.757246 | 0.996904 | 0.803846 | 0.998450 | 0.779851 |
| 4 | Tuned AdaBoost Classifier | 1.000000 | 0.888207 | 1.000000 | 0.659420 | 1.000000 | 0.722222 | 1.000000 | 0.689394 |
| 5 | Tuned Gradient Boosting Classifier | 0.996200 | 0.890934 | 0.981366 | 0.634058 | 0.998420 | 0.747863 | 0.989820 | 0.686275 |
# creating confusion matrix
make_confusion_matrix(gbc_estimator,X_test,y_test)
param_grid = {
"n_estimators": np.arange(10,100,40),
"scale_pos_weight":[0,2,5],
"subsample":[0.5,0.75,1],
"learning_rate":[0.05,0.1,0.2],
"gamma":[0,1,3],
"colsample_bytree":[0.5,0.75,1],
"colsample_bylevel":[0.5,0.75,1]
}
grid_obj = GridSearchCV(model_xgb, param_grid=param_grid, scoring='f1', cv=5, verbose=2, n_jobs=-1)
grid_obj.fit(X_train, y_train)
# Get the best combination of parameters
xgb_estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
xgb_estimator.fit(X_train, y_train)
Fitting 5 folds for each of 2187 candidates, totalling 10935 fits
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.75, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.2, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=90, n_jobs=8,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=5, subsample=1, tree_method='exact',
use_label_encoder=False, validate_parameters=1, verbosity=None)
# checking model performances for this model
scores_xgb_estimator = metrics_score(xgb_estimator,X_train,X_test,y_train,y_test,model_name='Tuned XGBoost Classifier')
scores_boosting = scores_boosting.append(scores_xgb_estimator, ignore_index=True)
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
| 3 | Stacking Classifier | 0.999415 | 0.919564 | 1.000000 | 0.757246 | 0.996904 | 0.803846 | 0.998450 | 0.779851 |
| 4 | Tuned AdaBoost Classifier | 1.000000 | 0.888207 | 1.000000 | 0.659420 | 1.000000 | 0.722222 | 1.000000 | 0.689394 |
| 5 | Tuned Gradient Boosting Classifier | 0.996200 | 0.890934 | 0.981366 | 0.634058 | 0.998420 | 0.747863 | 0.989820 | 0.686275 |
| 6 | Tuned XGBoost Classifier | 0.996200 | 0.927744 | 0.998447 | 0.797101 | 0.981679 | 0.814815 | 0.989992 | 0.805861 |
# creating confusion matrix
make_confusion_matrix(xgb_estimator,X_test,y_test)
Now, we are going to create a Tuned Stacking Classifier using the Tuned Decision Tree, Tuned AdaBoost Classifier, Tuned Random Forest, Tuned Gradient Boosting Classifier, and Tuned XGBoost Classifier
estimators = [('tabc', abc_estimator),
('trf', rf_estimator),
('tgbc', gbc_estimator),
('txgb', xgb_estimator)]
model_tuned_stacking = StackingClassifier(estimators=estimators, final_estimator=RandomForestClassifier(random_state=1),n_jobs=-1)
model_tuned_stacking.fit(X_train,y_train)
StackingClassifier(estimators=[('tabc',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=1.6,
n_estimators=160,
random_state=1)),
('trf',
RandomForestClassifier(class_weight={0: 0.18,
1: 0.82},
max_features=0.7,
min_samples_leaf=3,
n_estimators=110,
random_state=1)),
('tgbc',
GradientBoostingClassifier(learning_rate=0.600000000...
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()',
n_estimators=90, n_jobs=8,
num_parallel_tree=1,
random_state=1, reg_alpha=0,
reg_lambda=1, scale_pos_weight=5,
subsample=1, tree_method='exact',
use_label_encoder=False,
validate_parameters=1,
verbosity=None))],
final_estimator=RandomForestClassifier(random_state=1),
n_jobs=-1)
# checking model performances for this model
scores_tuned_stacking = metrics_score(model_tuned_stacking,X_train,X_test,y_train,y_test,model_name='Tuned Stacking Classifier')
scores_boosting = scores_boosting.append(scores_tuned_stacking, ignore_index=True)
scores_boosting
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
| 3 | Stacking Classifier | 0.999415 | 0.919564 | 1.000000 | 0.757246 | 0.996904 | 0.803846 | 0.998450 | 0.779851 |
| 4 | Tuned AdaBoost Classifier | 1.000000 | 0.888207 | 1.000000 | 0.659420 | 1.000000 | 0.722222 | 1.000000 | 0.689394 |
| 5 | Tuned Gradient Boosting Classifier | 0.996200 | 0.890934 | 0.981366 | 0.634058 | 0.998420 | 0.747863 | 0.989820 | 0.686275 |
| 6 | Tuned XGBoost Classifier | 0.996200 | 0.927744 | 0.998447 | 0.797101 | 0.981679 | 0.814815 | 0.989992 | 0.805861 |
| 7 | Tuned Stacking Classifier | 0.997077 | 0.921609 | 0.993789 | 0.739130 | 0.990712 | 0.825911 | 0.992248 | 0.780115 |
# Print scores for all models
scores_boosting.sort_values(by=['Recall on test set'], ascending=False)
| Model | Accuracy on training set | Accuracy on test set | Recall on training set | Recall on test set | Precision on training set | Precision on test set | F1 on training set | F1 on test set | |
|---|---|---|---|---|---|---|---|---|---|
| 6 | Tuned XGBoost Classifier | 0.996200 | 0.927744 | 0.998447 | 0.797101 | 0.981679 | 0.814815 | 0.989992 | 0.805861 |
| 3 | Stacking Classifier | 0.999415 | 0.919564 | 1.000000 | 0.757246 | 0.996904 | 0.803846 | 0.998450 | 0.779851 |
| 7 | Tuned Stacking Classifier | 0.997077 | 0.921609 | 0.993789 | 0.739130 | 0.990712 | 0.825911 | 0.992248 | 0.780115 |
| 2 | XGBoost Classifier | 0.999708 | 0.925699 | 0.998447 | 0.706522 | 1.000000 | 0.874439 | 0.999223 | 0.781563 |
| 4 | Tuned AdaBoost Classifier | 1.000000 | 0.888207 | 1.000000 | 0.659420 | 1.000000 | 0.722222 | 1.000000 | 0.689394 |
| 5 | Tuned Gradient Boosting Classifier | 0.996200 | 0.890934 | 0.981366 | 0.634058 | 0.998420 | 0.747863 | 0.989820 | 0.686275 |
| 1 | Gradient Boosting Classifier | 0.883367 | 0.862986 | 0.436335 | 0.369565 | 0.886435 | 0.790698 | 0.584807 | 0.503704 |
| 0 | AdaBoost Classifier | 0.845659 | 0.849352 | 0.318323 | 0.329710 | 0.697279 | 0.716535 | 0.437100 | 0.451613 |
# importance of features in the tree building
impor_fea = pd.DataFrame(xgb_estimator.feature_importances_, columns=["Imp"], index=X_train.columns).sort_values(by="Imp", ascending=False)
impor_fea = impor_fea.reset_index()
impor_fea.head(23)
| index | Imp | |
|---|---|---|
| 0 | Passport | 0.136961 |
| 1 | MaritalStatus_Single | 0.088133 |
| 2 | ProductPitched_Super Deluxe | 0.079314 |
| 3 | Occupation_Large Business | 0.056758 |
| 4 | ProductPitched_King | 0.054884 |
| 5 | ProductPitched_Deluxe | 0.054508 |
| 6 | CityTier | 0.046962 |
| 7 | PreferredPropertyStar | 0.042297 |
| 8 | NumberOfFollowups | 0.040386 |
| 9 | PitchSatisfactionScore | 0.037852 |
| 10 | DurationOfPitch | 0.036083 |
| 11 | Occupation_Small Business | 0.034659 |
| 12 | Age | 0.034024 |
| 13 | NumberOfTrips | 0.032568 |
| 14 | MonthlyIncome | 0.032438 |
| 15 | ProductPitched_Standard | 0.032164 |
| 16 | Occupation_Salaried | 0.028322 |
| 17 | Gender_Male | 0.027022 |
| 18 | OwnCar | 0.024030 |
| 19 | MaritalStatus_Married | 0.022292 |
| 20 | TypeofContact_Self Enquiry | 0.021688 |
| 21 | NumberOfPersonVisiting | 0.019990 |
| 22 | NumberOfChildrenVisiting | 0.016663 |
plt.figure(figsize=(12,12))
sns.barplot(x='Imp', y='index', data=impor_fea);
Passport, MaritalStatus_Single and ProdcutPitched_Super Deluxe are the most important features in the Tuned XGBoost Classifier
Visit with us company can use the Tuned XGBoost Classifier to properly identify the customers that would buy a packageVisit with us should target Single customers with Passport